Exploratory data analysis(EDA) is an approach to data analysis for summarising and visualising the important characteristics of a data set. EDA is not a formal process with a strict set of rules. EDA is an important part of any data analysis because you always need to investigate the quality of your data.

Objective

Your goal during EDA is to develop an understanding of your data.


Example of EDA

Univariate

Univariate Quantitative
Measures of central tendancy
  • Mean
  • Median
  • Mode
Measures of dispersion
  • Min
  • Max
  • Range
  • Quartiles
  • Variance
  • Standard deviation
Other measures include
  • Skewness
  • Kurtosis
Univariate Graphical
  • Histogram
  • Box plots
  • Bar plots
  • Kernel density plots
  • Bivariate
  • Bivariate Quantitative
  • Bivariate analysis include:

  • Crosstabs
  • Covariance
  • Correlation

  • Cluster analysis
  • Analysis of variance (ANOVA)
  • Factor analysis
  • Principal component analysis (PCA)
  • Bivariate Graphical
  • Graphical techniques include:

  • Scatterplot
  • Box plot

Diamonds Dataset

Merupakan data harga berlian dan karakteristiknya. Variabel yang digunakan adalah sebagai berikut

Variabel Keterangan
price price in ($US) ($326-$18,823)
carat weight of the diamond (0.2-5.01)
cut quality of the cut (Fair, Good, Very Good, Premium, Ideal)
color diamond colour, from J (worst) to D (best)
clarity a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
x length in mm (0-10.74)
y width in mm (0-58.9)
z depth in mm (0-31.8)
depth total depth percentage = z / mean(x, y) = 2 * z / (x + y), (43-79)
table width of top of diamond relative to widest point (43-95)
data("diamonds")
datatable(diamonds)
summary(diamonds)
##      carat               cut        color        clarity     
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655  
##                                     J: 2808   (Other): 2531  
##      depth           table           price             x         
##  Min.   :43.00   Min.   :43.00   Min.   :  326   Min.   : 0.000  
##  1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710  
##  Median :61.80   Median :57.00   Median : 2401   Median : 5.700  
##  Mean   :61.75   Mean   :57.46   Mean   : 3933   Mean   : 5.731  
##  3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540  
##  Max.   :79.00   Max.   :95.00   Max.   :18823   Max.   :10.740  
##                                                                  
##        y                z         
##  Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 4.720   1st Qu.: 2.910  
##  Median : 5.710   Median : 3.530  
##  Mean   : 5.735   Mean   : 3.539  
##  3rd Qu.: 6.540   3rd Qu.: 4.040  
##  Max.   :58.900   Max.   :31.800  
## 
Missing values

grafik berikut untuk mengecej missing values

library(naniar)
gg_miss_var(diamonds, show_pct = TRUE)

tidak ada missing values pada data diamonds

Visualization

Berikut adalah korelasi antar variabel dari data diamonds

num=diamonds[c('price','carat','x','y','z','depth','table')]
corrplot(cor(num),type="full",method="square")

beberapa variabel terjadi multikolinearitas, yaitu variabel carat, x, y, z. Variabel price berkorelasi positif cukup tinggi dengan variabel carat, x, y, z.

Price

berikut adalah histogram untuk variabel respon, price.

ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = price, fill=..count..), binwidth = 900)+
    scale_y_continuous(name = "Frequency") +
    scale_x_continuous(name = " Price ($US)") +
    ggtitle("Frequency histogram Price of diamonds ($US)")

Tampak histogram skew kanan, sehingga variabel price tidak berdistribusi normal. Mayoritas produsen memberikan harga sekitar 20000-4000 ($US). Sebagian besar Berlian harganya kurang dari $US 5000. Varibel Price dapat dilakukan transformasi.

carat

berikut adalah histogram untuk variabel, carat.

ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = carat, fill=..count..), binwidth = 0.4)+
    scale_y_continuous(name = "carat") +
    scale_x_continuous(name = " carat of the diamond") +
    ggtitle("Frequency histogram carat of the diamond")

Tampak histogram skew kanan, sehingga variabel carat tidak berdistribusi normal. Variabel carat bernilai antara 0,2 hingga 5.01, tampak pada histogram carat yang nilainya lebih dari 3 tidak muncul karena frekuensinya terlalu sedikit. Dapat dilakukan penyesuaian seperti histogram berikut.

ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = carat, fill=..count..), binwidth = 0.4)+
    scale_y_continuous(name = "Frequency") +
    scale_x_continuous(name = " weight of the diamond") +
    ggtitle("Frequency histogram weight of the diamond") +
    coord_cartesian(ylim = c(0, 100))

Atau dapat juga ditabelkan seperti tabel berikut.

diamonds %>% count(cut_width(carat, 0.4))
## # A tibble: 12 x 2
##    `cut_width(carat, 0.4)`     n
##    <fct>                   <int>
##  1 [0.2,0.6]               24448
##  2 (0.6,1]                 11990
##  3 (1,1.4]                 11093
##  4 (1.4,1.8]                4135
##  5 (1.8,2.2]                1812
##  6 (2.2,2.6]                 395
##  7 (2.6,3]                    35
##  8 (3,3.4]                    22
##  9 (3.4,3.8]                   4
## 10 (3.8,4.2]                   4
## 11 (4.2,4.6]                   1
## 12 (5,5.4]                     1
Boxplot

Berikut adalah Boxplot untuk setiap variabel numerik

require(reshape2)
ggplot(data = melt(num[,-1]), aes(x=variable, y=value)) +
  geom_boxplot() +
  facet_wrap(~variable, scales='free')

juga dapat ditampilkan Boxplot untuk variabel kategorik

kat=diamonds[c('cut','color','clarity','price')]

ggplot(data = kat, aes(x=cut, y=price)) +
  geom_boxplot()